Stock Market Analysis and Prediction¶

The project aims to Analyze Stock Prices from Historical Data (Nairobi Securities Exchange from 2007 - 2012) to understand patterns, predict future movements, and make data-driven investment decisions. This project focuses on technical analysis, which involves analyzing historical market data like prices, volumes, and trends. Specifically, the goal is to predict the closing price of stocks based on key predictors.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline


#html export
import plotly.io as pio
pio.renderers.default = 'notebook'
In [2]:
# import our data
df_2007 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2007.csv')
df_2008 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2008.csv')
df_2009 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2009.csv')
df_2010 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2010.csv')
df_2011 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2011.csv')
df_2012 = pd.read_csv(r'Nairobi Securities Exchange All Stocks Prices 2007-2012\NSE_data_all_stocks_2012.csv')

Columns and Description¶

  • DATE: The date of the stock information.
  • CODE: The stock or index code (e.g., ^NASI for NSE All-Share Index).
  • NAME: The name of the stock or index.
  • 12m Low: The lowest price in the last 12 months.
  • 12m High: The highest price in the last 12 months.
  • Day Low: The lowest price for the day.
  • Day High: The highest price for the day.
  • Day Price: The closing price for the day.
  • Previous: The previous closing price.
  • Change: The difference between the day's closing price and the previous closing price.
  • Change%: The percentage change from the previous closing price.
  • Volume: The number of shares traded during the day.
  • Adjust: Adjustment data for the stock (e.g., for splits or dividends).- **
In [3]:
df_2007
Out[3]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjust
0 1/2/2007 EGAD Eaagads 22 57 52 52 52 52 - - 300 26
1 1/2/2007 KAPC Kapchorua Tea 111 148 100 100 100 98.5 1.5 1.50% 7,800 -
2 1/2/2007 KUKZ Kakuzi 67.5 89 42.5 45 43.5 42.25 1.25 2.87% 2,700 -
3 1/2/2007 REA Rea Vipingo Plantations 14.5 23.5 25 26.5 25.5 25.75 -0.25 0.98% 10,500 -
4 1/2/2007 SASN Sasini Tea and Coffee 10.5 13.6 140 145 140 141 -1 0.71% 70,900 23.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10603 12/24/2007 EVRD Eveready East Africa 1.4 3 7 8.1 7.95 7.65 0.3 3.77% 84,200 -
10604 12/24/2007 MSC Mumias Sugar 4 7.35 13.7 15 14.8 14.3 0.5 3.38% 2,400,000 -
10605 12/24/2007 UNGA Unga Group 9.6 16.5 15 15.5 15.45 15.5 -0.05 0.32% 20,300 12.88
10606 12/24/2007 ACCS AccessKenya Group 3.9 10.1 21.5 24.75 23.25 22.5 0.75 3.23% 1,090,000 -
10607 12/24/2007 ^N20I NSE 20-Share Index 3,396.83 5,030.91 5,444.83 5,444.83 5,444.83 5,339.75 105.08 1.93% 12,530,000 -

10608 rows × 13 columns

In [4]:
df_2008
Out[4]:
Date CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjusted
0 1/2/2008 ^N20I NSE 20-Share Index 3,396.83 5,030.91 5,167.18 5,167.18 5,167.18 5,444.83 -277.65 5.37% 4,790,000 -
1 1/2/2008 ACCS AccessKenya Group 3.9 10.1 21 25 21.25 23.25 -2 9.41% 538,100 -
2 1/2/2008 UNGA Unga Group 9.6 16.5 15 15 15 15.45 -0.45 3.00% 25,000 12.5
3 1/2/2008 MSC Mumias Sugar 4 7.35 13.35 14.2 13.6 14.8 -1.2 8.82% 617,200 -
4 1/2/2008 EVRD Eveready East Africa 1.4 3 7.2 8 7.7 7.95 -0.25 3.25% 23,900 -
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10578 12/31/2008 C&G Car and General (K) 21 29 44 44 44 44 - - 200 29.33
10579 12/31/2008 WTK Williamson Tea Kenya 180 290 57.5 57.5 57.5 52.5 5 8.70% 1,000 -
10580 12/31/2008 SASN Sasini Tea and Coffee 10.5 13.6 7 7.35 7 6.7 0.3 4.29% 26,700 -
10581 12/31/2008 REA Rea Vipingo Plantations 14.5 23.5 13.5 14 13.95 13.65 0.3 2.15% 1,100 -
10582 12/31/2008 KAPC Kapchorua Tea 111 148 68 68 68 70 -2 2.94% 500 -

10583 rows × 13 columns

In [5]:
df_2009
Out[5]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjust
0 1/2/2009 ^N20I NSE 20-Share Index 3,396.83 5,030.91 3,589.16 3,589.16 3,589.16 3,521.18 67.98 1.89% 3,750,000 -
1 1/2/2009 ^NASI NSE All-Share Index 73.71 121.58 74.2 74.2 74.2 73.37 0.83 1.12% 3,750,000 -
2 1/2/2009 SCOM Safaricom 3.05 6.5 3.6 3.75 3.6 3.6 - - 1,220,000 -
3 1/2/2009 ACCS AccessKenya Group 3.9 10.1 21.25 22.75 22.25 20.75 1.5 6.74% 200,700 -
4 1/2/2009 UNGA Unga Group 9.6 16.5 12.85 13.6 12.9 13.6 -0.7 5.43% 4,500 -
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14381 12/31/2009 REA Rea Vipingo Plantations 14.5 23.5 11.5 11.5 11.5 11.25 0.25 2.17% 200 -
14382 12/31/2009 LIMT Limuru Tea 400 500 305 305 305 305 - - 0 -
14383 12/31/2009 KUKZ Kakuzi 67.5 89 31.75 31.75 31.75 31.75 - - 0 -
14384 12/31/2009 KAPC Kapchorua Tea 111 148 86 86 86 86 - - 0 -
14385 12/31/2009 EGAD Eaagads 22 57 20 20 20 20 - - 0 -

14386 rows × 13 columns

In [6]:
df_2010
Out[6]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjust
0 1/4/2010 ^NASI NSE All-Share Index 74.68 122.14 71.89 71.89 71.89 71.64 0.25 0.35% 3,730,000 -
1 1/4/2010 ^N20I NSE 20-Share Index 3,443.94 5,030.91 3,261.17 3,261.17 3,261.17 3,247.44 13.73 0.42% 3,730,000 -
2 1/4/2010 KPLC-P7 Kenya Power and Lighting 7% 5.5 5.5 5.5 5.5 5.5 5.5 - - 0 -
3 1/4/2010 KPLC-P4 Kenya Power and Lighting 4% 7.65 7.65 7.65 7.65 7.65 7.65 - - 0 -
4 1/4/2010 SCOM Safaricom 3.1 6.6 4.55 4.65 4.55 4.55 - - 2,800,000 -
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14947 12/31/2010 REA Rea Vipingo Plantations 14.5 23.5 15.8 17.5 17.45 17 0.45 2.58% 155,500 -
14948 12/31/2010 LIMT Limuru Tea 400 500 300 300 300 300 - - 0 -
14949 12/31/2010 KUKZ Kakuzi 67.5 89 80 82 81.5 77 4.5 5.52% 67,500 -
14950 12/31/2010 KAPC Kapchorua Tea 111 148 100 100 100 100 - - 0 -
14951 12/31/2010 EGAD Eaagads 22 57 50 50 50 50 - - 0 -

14952 rows × 13 columns

In [7]:
df_2011
Out[7]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjust
0 1/3/2011 EGAD Eaagads 22 57 50 53.5 52.5 50 2.5 4.76% 1,200 -
1 1/3/2011 KAPC Kapchorua Tea 111 148 100 100 100 100 - - 0 -
2 1/3/2011 KUKZ Kakuzi 67.5 89 81.5 81.5 81.5 81.5 - - 0 -
3 1/3/2011 LIMT Limuru Tea 400 500 300 300 300 300 - - 0 -
4 1/3/2011 REA Rea Vipingo Plantations 14.5 23.5 17 18.55 18.45 17.45 1 5.42% 25,000 -
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14926 12/30/2011 SCOM Safaricom 3.05 6.5 2.9 3.05 2.95 3 -0.05 1.69% 1,590,000 -
14927 12/30/2011 KPLC-P4 Kenya Power and Lighting 4% 7.65 7.65 7.65 7.65 7.65 7.65 - - 0 -
14928 12/30/2011 KPLC-P7 Kenya Power and Lighting 7% 5.5 5.5 5.5 5.5 5.5 5.5 - - 0 -
14929 12/30/2011 ^N20I NSE 20-Share Index 3,396.83 5,030.91 3,205.02 3,205.02 3,205.02 3,160.03 44.99 1.40% 4,710,000 -
14930 12/30/2011 ^NASI NSE All-Share Index 73.71 121.58 68.03 68.03 68.03 67.85 0.18 0.26% 4,710,000 -

14931 rows × 13 columns

In [8]:
df_2012
Out[8]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjust
0 1/3/2012 ^NASI NSE All-Share Index 73.23 119.37 68.38 68.38 68.38 68.03 0.35 0.51% 10,990,000 -
1 1/3/2012 ^N20I NSE 20-Share Index 3,312.56 4,985.91 3,212.86 3,212.86 3,212.86 3,205.02 7.84 0.24% 10,990,000 -
2 1/3/2012 KPLC-P7 Kenya Power and Lighting 7% 5.5 5.5 5.5 5.5 5.5 5.5 - - 0 -
3 1/3/2012 KPLC-P4 Kenya Power and Lighting 4% 7.65 7.65 7.65 7.65 7.65 7.65 - - 0 -
4 1/3/2012 SCOM Safaricom 3.05 6.5 2.9 3 2.95 2.95 - - 7,410,000 -
... ... ... ... ... ... ... ... ... ... ... ... ... ...
16039 12/31/2012 REA Rea Vipingo Plantations 14.5 23.5 19 19 19 19 - - 1,000 -
16040 12/31/2012 LIMT Limuru Tea 400 500 430 430 430 430 - - 0 -
16041 12/31/2012 KUKZ Kakuzi 67.5 89 72 72 72 69 3 4.17% 1,000 -
16042 12/31/2012 KAPC Kapchorua Tea 111 148 118 118 118 118 - - 100 -
16043 12/31/2012 EGAD Eaagads 22 57 25 25 25 25 - - 0 -

16044 rows × 13 columns

Preparing the dataset¶

In [9]:
# List of all dataframes
dfs = [df_2007, df_2008, df_2009, df_2010, df_2011, df_2012]

# Loop through each dataframe
for df in dfs:
    # Check if the column 'Adjust' exists
    if 'Adjust' in df.columns:
        # Rename 'Adjust' to 'Adjusted' for consistency
        df.rename(columns={'Adjust': 'Adjusted'}, inplace=True)

        # Check if the column 'Date' exists
    if 'Date' in df.columns:
        # Rename 'Date' to 'DATE' for consistency
        df.rename(columns={'Date': 'DATE'}, inplace=True)
     
In [10]:
# Combine all the dataframes from 2007 to 2012
combined_df = pd.concat([df_2007, df_2008, df_2009, df_2010, df_2011, df_2012])

# Reset the index to keep it clean
combined_df.reset_index(drop=True, inplace=True)
In [11]:
df = combined_df.copy()
df
Out[11]:
DATE CODE NAME 12m Low 12m High Day Low Day High Day Price Previous Change Change% Volume Adjusted
0 1/2/2007 EGAD Eaagads 22 57 52 52 52 52 - - 300 26
1 1/2/2007 KAPC Kapchorua Tea 111 148 100 100 100 98.5 1.5 1.50% 7,800 -
2 1/2/2007 KUKZ Kakuzi 67.5 89 42.5 45 43.5 42.25 1.25 2.87% 2,700 -
3 1/2/2007 REA Rea Vipingo Plantations 14.5 23.5 25 26.5 25.5 25.75 -0.25 0.98% 10,500 -
4 1/2/2007 SASN Sasini Tea and Coffee 10.5 13.6 140 145 140 141 -1 0.71% 70,900 23.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
81499 12/31/2012 REA Rea Vipingo Plantations 14.5 23.5 19 19 19 19 - - 1,000 -
81500 12/31/2012 LIMT Limuru Tea 400 500 430 430 430 430 - - 0 -
81501 12/31/2012 KUKZ Kakuzi 67.5 89 72 72 72 69 3 4.17% 1,000 -
81502 12/31/2012 KAPC Kapchorua Tea 111 148 118 118 118 118 - - 100 -
81503 12/31/2012 EGAD Eaagads 22 57 25 25 25 25 - - 0 -

81504 rows × 13 columns

In [12]:
# make column names uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')
In [13]:
df.columns
Out[13]:
Index(['date', 'code', 'name', '12m_low', '12m_high', 'day_low', 'day_high',
       'day_price', 'previous', 'change', 'change%', 'volume', 'adjusted'],
      dtype='object')
In [14]:
numerical_columns = ['12m_low', '12m_high', 'day_low', 'day_high', 'day_price', 'previous', 'change', 'volume', 'adjusted']
In [15]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81504 entries, 0 to 81503
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       81504 non-null  object
 1   code       81504 non-null  object
 2   name       81456 non-null  object
 3   12m_low    81504 non-null  object
 4   12m_high   81504 non-null  object
 5   day_low    81504 non-null  object
 6   day_high   81504 non-null  object
 7   day_price  81504 non-null  object
 8   previous   81504 non-null  object
 9   change     81504 non-null  object
 10  change%    81504 non-null  object
 11  volume     81504 non-null  object
 12  adjusted   81504 non-null  object
dtypes: object(13)
memory usage: 8.1+ MB
In [16]:
# Converting specified columns

# Replace commas with empty strings in the selected columns
df[numerical_columns] = df[numerical_columns].replace({',': ''}, regex=True)

# Convert to numeric
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')

# Remove '%' symbol, handle invalid values, and convert to decimal
df['change%'] = pd.to_numeric(df['change%'].str.replace('%', '', regex=True), errors='coerce') / 100
numerical_columns.append('change%')

# Convert the 'DATE' column to datetime
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y', errors='coerce')


df
Out[16]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume adjusted
0 2007-01-02 EGAD Eaagads 22.0 57.0 52.0 52.0 52.0 52.00 NaN NaN 300 26.00
1 2007-01-02 KAPC Kapchorua Tea 111.0 148.0 100.0 100.0 100.0 98.50 1.50 0.0150 7800 NaN
2 2007-01-02 KUKZ Kakuzi 67.5 89.0 42.5 45.0 43.5 42.25 1.25 0.0287 2700 NaN
3 2007-01-02 REA Rea Vipingo Plantations 14.5 23.5 25.0 26.5 25.5 25.75 -0.25 0.0098 10500 NaN
4 2007-01-02 SASN Sasini Tea and Coffee 10.5 13.6 140.0 145.0 140.0 141.00 -1.00 0.0071 70900 23.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
81499 2012-12-31 REA Rea Vipingo Plantations 14.5 23.5 19.0 19.0 19.0 19.00 NaN NaN 1000 NaN
81500 2012-12-31 LIMT Limuru Tea 400.0 500.0 430.0 430.0 430.0 430.00 NaN NaN 0 NaN
81501 2012-12-31 KUKZ Kakuzi 67.5 89.0 72.0 72.0 72.0 69.00 3.00 0.0417 1000 NaN
81502 2012-12-31 KAPC Kapchorua Tea 111.0 148.0 118.0 118.0 118.0 118.00 NaN NaN 100 NaN
81503 2012-12-31 EGAD Eaagads 22.0 57.0 25.0 25.0 25.0 25.00 NaN NaN 0 NaN

81504 rows × 13 columns

In [17]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81504 entries, 0 to 81503
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       81504 non-null  datetime64[ns]
 1   code       81504 non-null  object        
 2   name       81456 non-null  object        
 3   12m_low    81504 non-null  float64       
 4   12m_high   81504 non-null  float64       
 5   day_low    81504 non-null  float64       
 6   day_high   81504 non-null  float64       
 7   day_price  81504 non-null  float64       
 8   previous   81492 non-null  float64       
 9   change     51141 non-null  float64       
 10  change%    51131 non-null  float64       
 11  volume     81504 non-null  int64         
 12  adjusted   20509 non-null  float64       
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 8.1+ MB

Data Cleaning¶

In [18]:
# duplicates
df.duplicated().sum()
Out[18]:
19
In [19]:
#dropping duplicates
df = df.drop_duplicates()
In [20]:
df.duplicated().sum()
Out[20]:
0
In [21]:
df.isna().sum()
Out[21]:
date             0
code             0
name            48
12m_low          0
12m_high         0
day_low          0
day_high         0
day_price        0
previous        12
change       30344
change%      30354
volume           0
adjusted     60979
dtype: int64
In [22]:
# dataframe where 'previous' column has nulls
df[df['previous'].isna()]
Out[22]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume adjusted
15053 2008-05-30 HFCK-R NaN 11.05 11.05 5.00 20.00 6.55 NaN 6.55 1.0 17200 NaN
22070 2009-02-02 KPLC-P7 Kenya Power and Lighting 7% 5.50 5.50 5.00 5.00 5.00 NaN 5.00 1.0 0 NaN
22079 2009-02-02 CARB Carbacid Investments 88.50 140.00 137.00 137.00 137.00 NaN 137.00 1.0 0 45.67
22080 2009-02-02 BOC B.O.C Kenya 90.00 130.00 160.00 160.00 160.00 NaN 160.00 1.0 0 NaN
22099 2009-02-02 UCHM Uchumi Supermarket 12.70 24.00 14.50 14.50 14.50 NaN 14.50 1.0 0 NaN
44813 2010-08-16 TPSE-R TPS Eastern Africa (Serena) (Rights) 5.10 5.10 5.00 8.00 5.40 NaN 5.40 1.0 700 NaN
45711 2010-09-07 SCBK-R Standard Chartered Bank (Rights) 51.00 89.00 70.00 70.00 70.00 NaN 70.00 1.0 500 NaN
49242 2010-12-01 KPLC-R Kenya Power and Lighting (Rights) 0.55 0.55 4.00 4.50 4.00 NaN 4.00 1.0 11100 NaN
58420 2011-07-14 TCL Trans-Century 20.00 37.75 50.00 60.00 57.00 NaN 57.00 1.0 1140000 NaN
69653 2012-04-03 KQ-R Kenya Airways (Rights) 0.05 0.05 0.05 0.05 0.05 NaN 0.05 1.0 228200 NaN
72010 2012-05-30 LKL Longhorn Kenya 8.50 21.00 15.00 20.00 16.15 NaN 16.15 1.0 4200 NaN
74136 2012-07-18 CIC CIC Insurance Group 3.05 7.00 4.00 7.00 5.25 NaN 5.25 1.0 1620000 NaN
In [23]:
# Drop rows where 'Previous' column is null
df = df[df['previous'].notna()]
In [24]:
df.isna().sum()
Out[24]:
date             0
code             0
name            47
12m_low          0
12m_high         0
day_low          0
day_high         0
day_price        0
previous         0
change       30344
change%      30354
volume           0
adjusted     60968
dtype: int64
In [25]:
# dataframe where 'name' column has nulls
null_name = df[df['name'].isna()]
null_name
Out[25]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume adjusted
7359 2007-09-05 OCH-R NaN 0.60 0.60 0.50 1.00 0.95 19.20 -18.25 19.2105 253000 NaN
7403 2007-09-06 OCH-R NaN 0.60 0.60 0.45 1.00 0.45 0.95 -0.50 1.1111 511800 NaN
7448 2007-09-07 OCH-R NaN 0.60 0.60 0.05 0.50 0.20 0.45 -0.25 1.2500 404800 NaN
7492 2007-09-10 OCH-R NaN 0.60 0.60 0.15 0.70 0.40 0.20 0.20 0.5000 272200 NaN
7537 2007-09-11 OCH-R NaN 0.60 0.60 0.20 1.00 0.20 0.40 -0.20 1.0000 5170000 NaN
7583 2007-09-12 OCH-R NaN 0.60 0.60 0.25 1.00 0.55 0.20 0.35 0.6364 329000 NaN
7627 2007-09-13 OCH-R NaN 0.60 0.60 0.10 1.00 0.40 0.55 -0.15 0.3750 3040000 NaN
7671 2007-09-14 OCH-R NaN 0.60 0.60 0.40 2.00 0.60 0.40 0.20 0.3333 3650000 NaN
8723 2007-10-22 NIC-R NaN 5.05 98.00 54.50 55.00 54.50 191.00 -136.50 2.5046 1000 NaN
8764 2007-10-23 NIC-R NaN 5.05 98.00 90.00 100.00 95.00 54.50 40.50 0.4263 5100 NaN
8807 2007-10-24 NIC-R NaN 5.05 98.00 95.00 100.00 99.00 95.00 4.00 0.0404 11200 NaN
8850 2007-10-25 NIC-R NaN 5.05 98.00 95.00 100.00 96.00 99.00 -3.00 0.0313 8500 NaN
8894 2007-10-26 NIC-R NaN 5.05 98.00 87.00 95.00 89.00 96.00 -7.00 0.0787 4600 NaN
8940 2007-10-29 NIC-R NaN 5.05 98.00 84.00 90.00 85.00 89.00 -4.00 0.0471 10700 NaN
8985 2007-10-30 NIC-R NaN 5.05 98.00 85.50 95.00 93.50 85.00 8.50 0.0909 5000 NaN
9029 2007-10-31 NIC-R NaN 5.05 98.00 93.00 100.00 95.00 93.50 1.50 0.0158 142100 NaN
9071 2007-11-01 NIC-R NaN 5.05 98.00 90.00 100.00 92.00 95.00 -3.00 0.0326 35500 NaN
9118 2007-11-02 NIC-R NaN 5.05 98.00 90.00 100.00 98.00 92.00 6.00 0.0612 13000 NaN
9195 2007-11-06 DTK-R NaN 2.30 50.00 9.00 9.00 9.00 15.70 -6.70 0.7444 4100 NaN
9239 2007-11-07 DTK-R NaN 2.30 50.00 26.50 26.50 26.50 9.00 17.50 0.6604 16000 NaN
9283 2007-11-08 DTK-R NaN 2.30 50.00 12.00 26.00 23.50 26.50 -3.00 0.1277 8800 NaN
9325 2007-11-09 DTK-R NaN 2.30 50.00 12.00 15.00 12.50 23.50 -11.00 0.8800 8600 NaN
9368 2007-11-12 DTK-R NaN 2.30 50.00 12.00 14.00 12.05 12.50 -0.45 0.0373 64100 NaN
15097 2008-06-03 HFCK-R NaN 11.05 11.05 8.00 11.00 10.05 6.55 3.50 0.3483 24200 NaN
15141 2008-06-04 HFCK-R NaN 11.05 11.05 8.00 15.00 11.85 10.05 1.80 0.1519 131900 NaN
15186 2008-06-05 HFCK-R NaN 11.05 11.05 11.00 15.00 13.05 11.85 1.20 0.0920 5820000 NaN
15230 2008-06-06 HFCK-R NaN 11.05 11.05 8.00 14.00 9.70 13.05 -3.35 0.3454 17540000 NaN
15272 2008-06-09 HFCK-R NaN 11.05 11.05 9.00 13.50 9.25 9.70 -0.45 0.0486 162500 NaN
15316 2008-06-10 HFCK-R NaN 11.05 11.05 9.55 11.00 10.45 9.25 1.20 0.1148 118200 NaN
15359 2008-06-11 HFCK-R NaN 11.05 11.05 10.00 12.00 10.40 10.45 -0.05 0.0048 137100 NaN
15404 2008-06-12 HFCK-R NaN 11.05 11.05 10.05 13.25 11.65 10.40 1.25 0.1073 254700 NaN
15448 2008-06-13 HFCK-R NaN 11.05 11.05 10.00 12.50 11.05 11.65 -0.60 0.0543 342600 NaN
15708 2008-06-23 KCB-R NaN 0.25 0.25 5.60 6.80 6.60 217.00 -210.40 31.8788 86700 NaN
15753 2008-06-24 KCB-R NaN 0.25 0.25 5.50 6.80 5.75 6.60 -0.85 0.1478 147700 NaN
15797 2008-06-25 KCB-R NaN 0.25 0.25 5.50 31.25 9.85 5.75 4.10 0.4162 76200 NaN
15843 2008-06-26 KCB-R NaN 0.25 0.25 4.35 30.00 6.35 9.85 -3.50 0.5512 93800 NaN
15888 2008-06-27 KCB-R NaN 0.25 0.25 1.00 34.75 5.25 6.35 -1.10 0.2095 1010000 NaN
15933 2008-06-30 KCB-R NaN 0.25 0.25 5.00 6.00 5.35 5.25 0.10 0.0187 109100 NaN
15979 2008-07-01 KCB-R NaN 0.25 0.25 5.05 6.50 5.35 5.35 NaN NaN 133700 NaN
16026 2008-07-02 KCB-R NaN 0.25 0.25 5.05 6.00 5.45 5.35 0.10 0.0183 239200 NaN
16074 2008-07-03 KCB-R NaN 0.25 0.25 4.35 31.25 5.05 5.45 -0.40 0.0792 3790000 NaN
16122 2008-07-04 KCB-R NaN 0.25 0.25 5.00 5.50 5.25 5.05 0.20 0.0381 165300 NaN
16170 2008-07-07 KCB-R NaN 0.25 0.25 3.00 5.35 5.15 5.25 -0.10 0.0194 246800 NaN
16218 2008-07-08 KCB-R NaN 0.25 0.25 4.80 5.30 5.00 5.15 -0.15 0.0300 45720000 NaN
16265 2008-07-09 KCB-R NaN 0.25 0.25 5.05 6.00 5.10 5.00 0.10 0.0196 13580000 NaN
16311 2008-07-10 KCB-R NaN 0.25 0.25 5.20 9.60 5.45 5.10 0.35 0.0642 448300 NaN
16357 2008-07-11 KCB-R NaN 0.25 0.25 5.50 7.00 5.95 5.45 0.50 0.0840 4720000 NaN
In [26]:
# getting associated codes for null names
null_name['code'].unique()
Out[26]:
array(['OCH-R', 'NIC-R', 'DTK-R', 'HFCK-R', 'KCB-R'], dtype=object)
In [27]:
# dataframe with associated codes for null names
null_name_codes = df[df['code'].isin(['OCH-R', 'NIC-R', 'DTK-R', 'HFCK-R', 'KCB-R'])]
null_name_codes
Out[27]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume adjusted
7359 2007-09-05 OCH-R NaN 0.60 0.6 0.50 1.0 0.95 19.20 -18.25 19.2105 253000 NaN
7403 2007-09-06 OCH-R NaN 0.60 0.6 0.45 1.0 0.45 0.95 -0.50 1.1111 511800 NaN
7448 2007-09-07 OCH-R NaN 0.60 0.6 0.05 0.5 0.20 0.45 -0.25 1.2500 404800 NaN
7492 2007-09-10 OCH-R NaN 0.60 0.6 0.15 0.7 0.40 0.20 0.20 0.5000 272200 NaN
7537 2007-09-11 OCH-R NaN 0.60 0.6 0.20 1.0 0.20 0.40 -0.20 1.0000 5170000 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
76209 2012-09-03 NIC-R NIC Bank (Rights) 5.05 98.0 9.05 13.0 10.40 10.00 0.40 0.0385 149900 NaN
76274 2012-09-04 NIC-R NIC Bank (Rights) 5.05 98.0 9.40 13.0 11.15 10.40 0.75 0.0673 121700 NaN
76339 2012-09-05 NIC-R NIC Bank (Rights) 5.05 98.0 12.50 15.0 14.55 11.15 3.40 0.2337 248100 NaN
76404 2012-09-06 NIC-R NIC Bank (Rights) 5.05 98.0 12.00 14.5 13.05 14.55 -1.50 0.1149 486500 NaN
76469 2012-09-07 NIC-R NIC Bank (Rights) 5.05 98.0 10.05 13.8 12.85 13.05 -0.20 0.0156 1290000 NaN

74 rows × 13 columns

In [28]:
# unique names for the associated codes
null_name_codes['name'].unique()
Out[28]:
array([nan, 'Kenya Commercial Bank (Rights)',
       'Diamond Trust Bank Kenya (Rights)', 'NIC Bank (Rights)'],
      dtype=object)
In [29]:
# Create the dictionary for replacements
replacement_dict = {
    'NIC-R': 'NIC Bank (Rights)',
    'DTK-R': 'Diamond Trust Bank Kenya (Rights)',
    'KCB-R': 'Kenya Commercial Bank (Rights)',

    'OCH-R': 'Olympia Capital Holdings Ltd (Rights)',
    'HFCK-R': 'HF Group Ltd (Rights)'
}
In [30]:
# Iterate through the replacement dictionary
for code, replacement_name in replacement_dict.items():
    # Update the 'NAME' column where the 'CODE' matches and 'NAME' is null
    df.loc[(df['code'] == code) & (df['name'].isna()), 'name'] = replacement_name
In [31]:
df.isna().sum()
Out[31]:
date             0
code             0
name             0
12m_low          0
12m_high         0
day_low          0
day_high         0
day_price        0
previous         0
change       30344
change%      30354
volume           0
adjusted     60968
dtype: int64
In [32]:
# dataframe where 'change' column has nulls
df[df['change'].isna()]
Out[32]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume adjusted
0 2007-01-02 EGAD Eaagads 22.00 57.00 52.0 52.0 52.0 52.0 NaN NaN 300 26.00
7 2007-01-02 C&G Car and General (K) 21.00 29.00 50.0 50.0 50.0 50.0 NaN NaN 66700 33.33
16 2007-01-02 NBK National Bank of Kenya 16.25 24.25 57.0 60.0 58.5 58.5 NaN NaN 84900 41.79
17 2007-01-02 NIC NIC Bank 22.95 58.00 101.0 105.0 102.0 102.0 NaN NaN 9800 23.22
18 2007-01-02 SCBK Standard Chartered Bank 160.00 320.00 202.0 210.0 205.0 205.0 NaN NaN 54400 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
81498 2012-12-31 SASN Sasini Tea and Coffee 10.50 13.60 11.7 12.0 11.7 11.7 NaN NaN 32200 NaN
81499 2012-12-31 REA Rea Vipingo Plantations 14.50 23.50 19.0 19.0 19.0 19.0 NaN NaN 1000 NaN
81500 2012-12-31 LIMT Limuru Tea 400.00 500.00 430.0 430.0 430.0 430.0 NaN NaN 0 NaN
81502 2012-12-31 KAPC Kapchorua Tea 111.00 148.00 118.0 118.0 118.0 118.0 NaN NaN 100 NaN
81503 2012-12-31 EGAD Eaagads 22.00 57.00 25.0 25.0 25.0 25.0 NaN NaN 0 NaN

30344 rows × 13 columns

In [33]:
# replacing nulls with zero to indicate no change
df['change'] = df['change'].fillna(0)
df['change%'] = df['change%'].fillna(0)
In [34]:
df.isna().sum()
Out[34]:
date             0
code             0
name             0
12m_low          0
12m_high         0
day_low          0
day_high         0
day_price        0
previous         0
change           0
change%          0
volume           0
adjusted     60968
dtype: int64
In [35]:
df['adjusted'].isna().sum()/len(df) * 100
Out[35]:
74.83215298319689

Approximately 75% of the adjusted column is null hence safe to drop this column

In [36]:
# Drop adjusted column
del df['adjusted']
In [37]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 81473 entries, 0 to 81503
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       81473 non-null  datetime64[ns]
 1   code       81473 non-null  object        
 2   name       81473 non-null  object        
 3   12m_low    81473 non-null  float64       
 4   12m_high   81473 non-null  float64       
 5   day_low    81473 non-null  float64       
 6   day_high   81473 non-null  float64       
 7   day_price  81473 non-null  float64       
 8   previous   81473 non-null  float64       
 9   change     81473 non-null  float64       
 10  change%    81473 non-null  float64       
 11  volume     81473 non-null  int64         
dtypes: datetime64[ns](1), float64(8), int64(1), object(2)
memory usage: 8.1+ MB
In [38]:
df
Out[38]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume
0 2007-01-02 EGAD Eaagads 22.0 57.0 52.0 52.0 52.0 52.00 0.00 0.0000 300
1 2007-01-02 KAPC Kapchorua Tea 111.0 148.0 100.0 100.0 100.0 98.50 1.50 0.0150 7800
2 2007-01-02 KUKZ Kakuzi 67.5 89.0 42.5 45.0 43.5 42.25 1.25 0.0287 2700
3 2007-01-02 REA Rea Vipingo Plantations 14.5 23.5 25.0 26.5 25.5 25.75 -0.25 0.0098 10500
4 2007-01-02 SASN Sasini Tea and Coffee 10.5 13.6 140.0 145.0 140.0 141.00 -1.00 0.0071 70900
... ... ... ... ... ... ... ... ... ... ... ... ...
81499 2012-12-31 REA Rea Vipingo Plantations 14.5 23.5 19.0 19.0 19.0 19.00 0.00 0.0000 1000
81500 2012-12-31 LIMT Limuru Tea 400.0 500.0 430.0 430.0 430.0 430.00 0.00 0.0000 0
81501 2012-12-31 KUKZ Kakuzi 67.5 89.0 72.0 72.0 72.0 69.00 3.00 0.0417 1000
81502 2012-12-31 KAPC Kapchorua Tea 111.0 148.0 118.0 118.0 118.0 118.00 0.00 0.0000 100
81503 2012-12-31 EGAD Eaagads 22.0 57.0 25.0 25.0 25.0 25.00 0.00 0.0000 0

81473 rows × 12 columns

EDA (Exploratory Data Analysis)¶

Taking 'day_price' as our target variable

In [39]:
numericals = ['12m_low', '12m_high', 'day_low', 'day_high',
       'day_price', 'previous', 'change', 'change%', 'volume']

categorical = ['code']
In [40]:
# trying to see the trend of 'day_price' across the years
fig = px.line(df, x=df['date'], y=df['day_price'])
fig.update_layout(title='Daily Day Price', title_x=0.5, title_y=0.95 ) # Center the title (0.5 means centered)
In [41]:
df2 = df.copy()

# Set the 'date' column as the index
df2.set_index('date', inplace=True)

# Now resample 'day_price' by month and compute the mean
monthly_avg = df2['day_price'].resample('M').mean()

# Create a line plot with monthly averages
fig = px.line(monthly_avg, x=monthly_avg.index, y=monthly_avg, labels={'x': 'Date', 'y': 'Average Day Price'})
fig.update_layout(title='Monthly Average Day Price', title_x=0.5, title_y=0.95 ) # Center the title (0.5 means centered)

# Show the plot
fig.show()
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_13476\694275928.py:7: FutureWarning:

'M' is deprecated and will be removed in a future version, please use 'ME' instead.

In [42]:
px.histogram(df, x=df['day_price'])

∴ The prices have a long tail distribution as most prices are cheap and few expensive ones

In [43]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots


# Create subplots for each numerical column
fig = make_subplots(rows=3, cols=3, subplot_titles=[f'Boxplot for {col}' for col in numericals])

# Loop through each numerical column and create a boxplot
for i, column in enumerate(numericals, 1):
    row = (i - 1) // 3 + 1  # Determine the row index (1 to 3)
    col = (i - 1) % 3 + 1   # Determine the column index (1 to 3)
    
    fig.add_trace(
        go.Box(
            y=df[column],  # Data for the boxplot
            name=column       # Name of the boxplot (will appear in the title)
        ),
        row=row, col=col       # Specify the location of the subplot
    )

# Update the layout to add a title and improve spacing
fig.update_layout(
    title_text="Boxplots for Numerical Columns",  # Title for the entire plot
    title_x=0.5,  # Center the title
    height=900,   # Adjust the height of the plot
    showlegend=False  # Hide the legend (not needed for boxplots)
)

# Show the plot
fig.show()
Handling Outliers¶

Winsorization reduces the effect of outliers by capping the extreme values in each numerical column between the 20th and 80th percentiles. This makes the data less sensitive to extreme outliers while retaining the overall distribution.

In [44]:
from scipy.stats import mstats

df3 = df.copy()


# Winsorize the data for each numerical column
for column in numericals:
    df3[column] = mstats.winsorize(df3[column], limits=[0.2, 0.2])  # Replace outliers with 20th and 80th percentiles

# Create subplots for each numerical column
fig = make_subplots(rows=3, cols=3, subplot_titles=[f'Boxplot for {col}' for col in numericals])

# Loop through each numerical column and create a boxplot
for i, column in enumerate(numericals, 1):
    row = (i - 1) // 3 + 1  # Determine the row index (1 to 3)
    col = (i - 1) % 3 + 1   # Determine the column index (1 to 3)
    
    fig.add_trace(
        go.Box(
            y=df3[column],  # Data for the boxplot
            name=column    # Name of the boxplot (will appear in the title)
        ),
        row=row, col=col   # Specify the location of the subplot
    )

# Update the layout to add a title and improve spacing
fig.update_layout(
    title_text="Boxplots for Winsorized Numerical Columns",  # Title for the entire plot
    title_x=0.5,  # Center the title
    height=900,   # Adjust the height of the plot
    showlegend=False  # Hide the legend (not needed for boxplots)
)

# Show the plot
fig.show()

Model Training¶

In [67]:
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge
In [68]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
In [69]:
# Import dataset
df3
Out[69]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume
0 2007-01-02 EGAD Eaagads 22.0 57.0 52.0 52.0 52.0 52.00 0.00 0.0000 300
1 2007-01-02 KAPC Kapchorua Tea 88.5 148.0 100.0 100.0 100.0 98.50 0.25 0.0150 7800
2 2007-01-02 KUKZ Kakuzi 67.5 89.0 42.5 45.0 43.5 42.25 0.25 0.0230 2700
3 2007-01-02 REA Rea Vipingo Plantations 14.5 23.5 25.0 26.5 25.5 25.75 -0.25 0.0098 10500
4 2007-01-02 SASN Sasini Tea and Coffee 10.5 13.6 125.0 129.0 127.0 127.00 -0.25 0.0071 70900
... ... ... ... ... ... ... ... ... ... ... ... ...
81499 2012-12-31 REA Rea Vipingo Plantations 14.5 23.5 19.0 19.0 19.0 19.00 0.00 0.0000 1000
81500 2012-12-31 LIMT Limuru Tea 88.5 148.0 125.0 129.0 127.0 127.00 0.00 0.0000 200
81501 2012-12-31 KUKZ Kakuzi 67.5 89.0 72.0 72.0 72.0 69.00 0.25 0.0230 1000
81502 2012-12-31 KAPC Kapchorua Tea 88.5 148.0 118.0 118.0 118.0 118.00 0.00 0.0000 200
81503 2012-12-31 EGAD Eaagads 22.0 57.0 25.0 25.0 25.0 25.00 0.00 0.0000 200

81473 rows × 12 columns

In [70]:
# Define X and Y
numericals = ['12m_low', '12m_high', 'day_low', 'day_high','previous','volume']

x = df3[numericals].values
y = df3['day_price'].values
In [71]:
# Split the dataset into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Training and evaluating different models

In [77]:
# Scale the features to transforms the data into a distribution with:
# A mean of 0
# A standard deviation of 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'KNN Regression': KNeighborsRegressor(n_neighbors=5),
    'Decision Tree Regression': DecisionTreeRegressor(),


}

# Evaluate models
results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    results.append([name, mae, mse, rmse, r2])

# Display results
results_df = pd.DataFrame(results, columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2'])
results_df.set_index('Model', inplace=True)
results_df
Out[77]:
MAE MSE RMSE R2
Model
Linear Regression 0.175972 0.367902 0.606549 0.999823
Ridge Regression 0.176169 0.369004 0.607457 0.999822
KNN Regression 0.242779 0.673791 0.820848 0.999676
Decision Tree Regression 0.174512 0.436381 0.660591 0.999790
In [82]:
# Plotly Bar graph
fig = px.bar(
    results_df, 
    x=results_df.index, 
    y=['MAE', 'MSE', 'RMSE', 'R2'], 
    barmode='group', 
    title="Model Evaluation Metrics", 
    labels={"value": "Score", "Model": "Model"},
    height=500
)

# Show the plot
fig.show()
In [78]:
# Find the best model based on the desired metric (R²)
best_model_r2 = results_df['R2'].idxmax()  # Model with highest R²
best_model_mse = results_df['MSE'].idxmin()  # Model with lowest MSE
best_model_mae = results_df['MAE'].idxmin()  # Model with lowest MAE
best_model_rmse = results_df['RMSE'].idxmin()  # Model with lowest RMSE

# Display the best models based on each metric
print(f"Best Model based on R²: {best_model_r2}")
print(f"Best Model based on MSE: {best_model_mse}")
print(f"Best Model based on MAE: {best_model_mae}")
print(f"Best Model based on RMSE: {best_model_rmse}")
Best Model based on R²: Linear Regression
Best Model based on MSE: Linear Regression
Best Model based on MAE: Decision Tree Regression
Best Model based on RMSE: Linear Regression

∴ Linear Regression is the better performing model

In [80]:
# Initialize models
lr_model = LinearRegression()
dt_model = DecisionTreeRegressor()

# Fit models to the training data
lr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)
y_pred_dt = dt_model.predict(X_test)

Error Plots

In [83]:
# Create a figure and axis
plt.figure(figsize=(14, 6))

# Plot predicted vs actual values for Linear Regression
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_lr, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')  # Line of perfect prediction
plt.title('Linear Regression: Predicted vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Plot predicted vs actual values for Decision Tree Regressor
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_dt, color='green', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')  # Line of perfect prediction
plt.title('Decision Tree Regressor: Predicted vs Actual')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Display the plots
plt.tight_layout()
plt.show()
No description has been provided for this image

Our Models are fairly good as most points lie along the line where predicted = actual (y = x).

Residual plots

In [81]:
import matplotlib.pyplot as plt

# Calculate residuals for Linear Regression
residuals_lr = y_test - y_pred_lr

# Calculate residuals for Decision Tree Regressor
residuals_dt = y_test - y_pred_dt

# Create residual plot for Linear Regression
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.scatter(y_pred_lr, residuals_lr, color='blue', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot for Linear Regression')
plt.xlabel('Predicted Values (Linear Regression)')
plt.ylabel('Residuals')

# Create residual plot for Decision Tree Regressor
plt.subplot(1, 2, 2)
plt.scatter(y_pred_dt, residuals_dt, color='green', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot for Decision Tree Regressor')
plt.xlabel('Predicted Values (Decision Tree Regressor)')
plt.ylabel('Residuals')

# Show the plots
plt.tight_layout()
plt.show()
No description has been provided for this image

Our Models are fairly good as the residuals are scattered randomly close to the zero line

Using the Model¶

In [85]:
df3
Out[85]:
date code name 12m_low 12m_high day_low day_high day_price previous change change% volume
0 2007-01-02 EGAD Eaagads 22.0 57.0 52.0 52.0 52.0 52.00 0.00 0.0000 300
1 2007-01-02 KAPC Kapchorua Tea 88.5 148.0 100.0 100.0 100.0 98.50 0.25 0.0150 7800
2 2007-01-02 KUKZ Kakuzi 67.5 89.0 42.5 45.0 43.5 42.25 0.25 0.0230 2700
3 2007-01-02 REA Rea Vipingo Plantations 14.5 23.5 25.0 26.5 25.5 25.75 -0.25 0.0098 10500
4 2007-01-02 SASN Sasini Tea and Coffee 10.5 13.6 125.0 129.0 127.0 127.00 -0.25 0.0071 70900
... ... ... ... ... ... ... ... ... ... ... ... ...
81499 2012-12-31 REA Rea Vipingo Plantations 14.5 23.5 19.0 19.0 19.0 19.00 0.00 0.0000 1000
81500 2012-12-31 LIMT Limuru Tea 88.5 148.0 125.0 129.0 127.0 127.00 0.00 0.0000 200
81501 2012-12-31 KUKZ Kakuzi 67.5 89.0 72.0 72.0 72.0 69.00 0.25 0.0230 1000
81502 2012-12-31 KAPC Kapchorua Tea 88.5 148.0 118.0 118.0 118.0 118.00 0.00 0.0000 200
81503 2012-12-31 EGAD Eaagads 22.0 57.0 25.0 25.0 25.0 25.00 0.00 0.0000 200

81473 rows × 12 columns

In [90]:
# Select the features from the i'th row 
features = ['12m_low', '12m_high', 'day_low', 'day_high', 'previous', 'volume']

X_single_row = df3.loc[4, features].values.reshape(1, -1)  # Reshape to 2D array

# Make a prediction using the trained model
y_pred_single = lr_model.predict(X_single_row)

# Get the actual price from the df3['day_price'] column for the same row
y_actual_single = df3.loc[4, 'day_price']

# Output the prediction and actual value
print(f"Predicted value : {y_pred_single[0]}")
print(f"Actual value : {y_actual_single}")
Predicted value : 126.91836872539568
Actual value : 127.0